24 research outputs found

    Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs

    Full text link
    As the number of transistors integrated on a chip continues to increase, a growing challenge is accurately modeling per-formance in the early stages of processor design. Analytical models have been employed to rapidly search for higher performance designs, and can provide insights that detailed simulators may not. This paper proposes techniques to predict the impact of pending cache hits, hardware prefetching, and realistic miss status holding register (MSHR) resources on superscalar performance in the presence of long latency memory systems when employing hybrid analytical models that apply instruction trace analysis. Pending cache hits are secondary references to a cache block for which a request has already been initiated but has not yet completed. We find pending hits resulting from spatial locality and the fine-grained selection of instruction profile window blocks used for analysis both have non-negligible influences on the accuracy of hybrid analytical models and subsequently propose techniques to account for their effects. We then introduce techniques to estimate the performance impact of data prefetching by modeling the timeliness of prefetches and to account for a limited number of MSHRs by restricting the size of profile window blocks. As with earlier hybrid analytical models, our approach is roughly two orders of magnitude faster than detailed simulations. When modeling pending hits for a processor with unlimited outstanding misses we improve the accuracy of our baseline by a factor of 3.9, decreasing average error from 39.7 % to 10.3%. When modeling a processor with data prefetching, a limited number of MSHRs, or both, the techniques result in an average error of 13.8%, 9.5 % and 17.8%, respectively. 1

    Complexity effective memory access scheduling for many-core accelerator architectures

    Full text link
    Modern DRAM systems rely on memory controllers that employ out-of-order scheduling to maximize row access lo-cality and bank-level parallelism, which in turn maximizes DRAM bandwidth. This is especially important in graphics processing unit (GPU) architectures, where the large quan-tity of parallelism places a heavy demand on the memory system. The logic needed for out-of-order scheduling can be expensive in terms of area, especially when compared to an in-order scheduling approach. In this paper, we propose a complexity-effective solution to DRAM request schedul-ing which recovers most of the performance loss incurred by a naive in-order first-in first-out (FIFO) DRAM scheduler compared to an aggressive out-of-order DRAM scheduler. We observe that the memory request stream from individual GPU“shader cores ” tends to have sufficient row access local-ity to maximize DRAM efficiency in most applications with-out significant reordering. However, the interconnection net-work across which memory requests are sent from the shader cores to the DRAM controller tends to finely interleave the numerous memory request streams in a way that destroys the row access locality of the resultant stream seen at the DRAM controller. To address this, we employ an intercon-nection network arbitration scheme that preserves the row access locality of individual memory request streams and, in doing so, achieves DRAM efficiency and system perfor-mance close to that achievable by using out-of-order mem-ory request scheduling while doing so with a simpler de-sign. We evaluate our interconnection network arbitration scheme using crossbar, mesh, and ring networks for a base-line architecture of 8 memory channels, each controlled by its own DRAM controller and 28 shader cores (224 ALUs), supporting up to 1,792 in-flight memory requests. Our re-sults show that our interconnect arbitration scheme coupled with a banked FIFO in-order scheduler obtains up to 91% of the performance obtainable with an out-of-order memory scheduler for a crossbar network with eight-entry DRAM controller queues

    Thread block compaction for efficient SIMT control flow

    No full text
    Manycore accelerators such as graphics processor units (GPUs) organize processing units into single-instruction, multiple data “cores ” to improve throughput per unit hardware cost. Programming models for these acceler-ators encourage applications to run kernels with large groups of parallel scalar threads. The hardware groups these threads into warps/wavefronts and executes them in lockstep—dubbed single-instruction, multiple-thread (SIMT) by NVIDIA. While current GPUs employ a per-warp (or per-wavefront) stack to manage divergent control flow, it incurs decreased efficiency for applications with nested, data-dependent control flow. In this paper, we propose and evaluate the benefits of extending the sharing of resources in a block of warps, already used for scratchpad mem-ory, to exploit control flow locality among threads (where such sharing may at first seem detrimental). In our pro-posal, warps within a thread block share a common block-wide stack for divergence handling. At a divergent branch, threads are compacted into new warps in hardware. Our simulation results show that this compaction mechanism provides an average speedup of 22 % over a baseline per-warp, stack-based reconvergence mechanism, and 17 % ver-sus dynamic warp formation on a set of CUDA applications that suffer significantly from control flow divergence. 1

    Learning your limit

    No full text
    corecore